Let’s get started with this course. Right now, I’m feeling rather nice – submitted my first article earlier today. Here’s hoping it doesn’t bounce back immediately.
What I expect from this course:
Improve my R skills – I’ve played around it a few times, but Python is more my thing.
Remind myself of git & GitHub functions.
Improve my statistical analysis skills.
I heard about this course via an email.
I thought the intro with Health Data Science worked quite nicely, though there is a lot of content to go through.
What I’m not yet sold on is R’s syntax. There seems to be rather many ways and libraries to do things, and for someone moving from Python and Pandas, none of them as intuitive. Of course, this is at least partly a matter of taste and habit.
Moving on:
# This is a so-called "R chunk" where you can write R code.
date <- format(Sys.Date(), format="%d %m %Y")
cat('hello world on', date)
## hello world on 12 12 2023
Find my repo here: https://github.com/tadusko/IODS-project
Or just here
This week started our journey to regression analysis with simple and multiple linear models. Ways of validating the models graphically are also used.
This week, we’ll be working on survey data to understand connections between the attitudes and motivations of students with their performance in an exam. The data is from 2014 and is collected by Kimmo Vehkalahti. Full metadata description is provided here.
Since the original data includes tens of variables, the data has been preprocessed by collapsing the answers to many Likert-scaled claims to single answers by theme. For example, the variable “stra” represents the mean of answers related to “Study strategy”.
Let’s read in the data and explore its structure:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
df <- read_csv("data/learning2014.csv", show_col_types = FALSE)
str(df)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
dim(df)
## [1] 166 7
Cool! There are 166 observations (rows) and 7 variables (columns). The variables are all numerical save for gender, which is a binary character (F/M). Points tells the exam performance of that student and is our target variable.
Let’s start with a more thorough exploration of the variables. It’s important to know the distribution of the variables and their interconnections. As a reminder, the task is to predict exam performance from survey results that try to capture the study strategies and attitudes of students.
Let’s therefore create a graphical representation of the variables and a summary statistics table.
# calling the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# creating a plot matrix to explore the distribution of the variables and the relationships between them.
p <- ggpairs(df, mapping = aes(), title="Relationships between the variables", progress = FALSE, lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
# summary stats of the numerical columns
summary(dplyr::select(df, -c(gender)))
## age attitude deep stra
## Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
# counts per gender
table(df$gender)
##
## F M
## 110 56
Starting with the background variables (age & gender), the participants skew young (median: 22, mean: 26). There are about twice as many women as men in the data.
Looking at histograms, many of the survey result variables (attitude, deep, stra, surf), are approximately normally distributed. The variables are mostly not correlated with each other.
The rightmost columns shows correlations between exam points and background/survey variables. All except one are not meaningful: attitude towards statistics is positively correlated with exam performance (Pearson correlation coefficient: 0.437). Strategic learning and surface learning related questions yield the next highest correlations (0.146 and -0.144, respectively).
Based on the correlation coefficients, let’s build a regression model with three explanatory variables: attitude towards statistics (ATTITUDE), strategic learning questions (STRA) and surface learning questions (SURF).
# create a regression model with multiple explanatory variables
multi_model <- lm(points ~ attitude + stra + surf, data = df)
# summarize model
summary(multi_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
This model does not work particularly well: multiple R-squared is only 0.2074 and adjusted 0.1927. R-squared values basically quantify how much of the variation in point outcomes are captured by the explanatory variables.
Similarly to the previous correlations, only ATTITUDE is statistically significant. Therefore, let’s fit another model without the non-significant variables.
# create a regression model with the significant variable
model <- lm(points ~ attitude, data = df)
# summarize model
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
While multiple R-squared is slightly lower, this is not a cause for concern. After all, including even spurious variables at a model will explain some of its noise.
Multiple R-squared for a model with only one explanatory variable is basically the square of the correlation coefficient (=0.437²)
In this model, the explanatory variable is significant. Let’s continue using it.
Finally, we ought to make sure the model does not violate the assumptions of normality and constant variance.
For this purpose, let’s plot the residuals against the fitted values, residuals in a Quantile-Quantile plot and Residuals vs Leverage.
par(mfrow = c(2,2))
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
plot(model, which=c(1,2,5))
Starting from Residuals vs Fitted, we are looking for roughly a similar jitter pattern all throughout the plot. While a few outliers exist, it seems good overall. The constant variance assumption is not violated.
Q-Q plot should form approximately a straight line. Again, that is achieved. The data follows roughly a normal distribution.
Leverage tells us how much influence individual observations have on the model fit. I don’t see anything too alarming.
Summary: the model does not violate any underlying assumptions. However, it does not explain the phenomena (exam performance) that well, either.
This week we dive into logistic regression. It’s as an approach that can be applied to a binary response variable. Since the previously used linear regression models have certain assumptions, such as normality assumption, we cannot use them here. Instead, we’ll rely on Generalized Linear Models (GLM). The output of the model will be a probability that the response variable is either true or false.
What are GLM’s? Let’s work through an exercise to find out. The task is to examine the high alcohol consumption and its interplay with various socioeconomic variables. A succesful model will be able to predict high alcohol consumption based on other variables tied to the individual.
We use a dataset of Portuguese secondary school students collected by combining school records and questionnaires. The raw data is created by Paolo Cortez and it is available at this link under CC BY 4.0. Other metadata can also be found at the link.
I have preprocessed the data by joining the two distinct datasets in the original file (math and Portuguese performance). In addition, I have created a new column (alc_use) that represents the mean alcohol use of students on a five-point scale from very low to very high. From that, I’ve derived a binary variate (high_use), which is true if the mean alcohol use is over 2.
library(dplyr)
library(readr)
alc <- read_csv("data/alc.csv", show_col_types = FALSE)
str(alc)
## spc_tbl_ [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr [1:370] "U" "U" "U" "U" ...
## $ famsize : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr [1:370] "A" "T" "T" "T" ...
## $ Medu : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr [1:370] "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr [1:370] "teacher" "other" "other" "services" ...
## $ reason : chr [1:370] "course" "course" "other" "home" ...
## $ guardian : chr [1:370] "mother" "father" "mother" "mother" ...
## $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
## $ famsup : chr [1:370] "no" "yes" "no" "yes" ...
## $ activities: chr [1:370] "no" "no" "no" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "no" "yes" "yes" "yes" ...
## $ romantic : chr [1:370] "no" "no" "no" "yes" ...
## $ famrel : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr [1:370] "no" "no" "yes" "yes" ...
## $ absences : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. paid = col_character(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. alc_use = col_double(),
## .. high_use = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
dim(alc)
## [1] 370 35
There are 370 observations and 35 variables – an extensive set of socioeconomic background variables! Check out the dataset’s metadata for more info on the variables
Let’s pick four variables of interest and examine their connection to alcohol consumption. Below are my picks and my hypotheses on their connection to alcohol:
Let’s look into the variables and their relationship with high alcohol consumption in detail to see if the hypotheses holds any water!
First, some simple bar plots. First plot simply shows the distribution of each variable. Then, each plot is divided into two: they show the distribution on that variable when high_use is True or False.
library(ggplot2)
library(tidyr)
library(purrr)
# extracting only the variables of interest to a new dataframe
extr <- alc[,c("sex", "traveltime", "studytime", "goout")]
# histograms
gather(extr) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
# printing out multiple plots was way harder than it should be
# anyway, this should do
walk(names(extr), ~{
g <- ggplot(data = alc, aes(x = !!sym(.)))
g <- g + geom_bar() + facet_wrap(~high_use)
print(g)
})
The plots show us at least that that men are more prevalent among the high users. In addition, the distribution for outgoing people between high users and others is different enough to be significant. However, for the other values, bar plots may not be the best approach.
Let’s therefore tabulate and group by the binary variables (high_use, sex) and calculate the mean values for each numerical subgroup.
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_travel = mean(traveltime), mean_study = mean(studytime), mean_out = mean(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 6
## # Groups: sex [2]
## sex high_use count mean_travel mean_study mean_out
## <chr> <lgl> <int> <dbl> <dbl> <dbl>
## 1 F FALSE 154 1.38 2.34 2.95
## 2 F TRUE 41 1.46 2 3.39
## 3 M FALSE 105 1.38 1.90 2.70
## 4 M TRUE 70 1.67 1.63 3.93
Interesting! We see evidence to support all of the hypotheses – male sex, long travel time, low study time and an outgoing personality indicate high use. However, the differences for mean travel and study times are quite small. A statistical test would be needed to say anything with more confidence.
Now, to make a GLM with the variables we have. As a reminder, the model tries to predict a binary value: whether alcohol use is high or not.
# creating the model
m <- glm(high_use ~ goout + traveltime + studytime + sex, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ goout + traveltime + studytime + sex,
## family = "binomial", data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.2372 0.6459 -5.012 5.39e-07 ***
## goout 0.7455 0.1200 6.212 5.24e-10 ***
## traveltime 0.3422 0.1795 1.907 0.05657 .
## studytime -0.4703 0.1704 -2.760 0.00579 **
## sexM 0.7021 0.2639 2.660 0.00780 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 376.10 on 365 degrees of freedom
## AIC: 386.1
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) goout traveltime studytime sexM
## -3.2371999 0.7454873 0.3422454 -0.4702584 0.7021031
Other than travel time, all the variables are significant (< 0.05). They affect the model in different ways: for example, an increase in study time decreases the probability of high use, whereas the effect is opposite for increasing outgoingness.
Next, let’s interpret the model coefficients as odds ratios and calculate confidence intervals for them.
# find the model with glm()
m <- glm(high_use ~ goout + traveltime + studytime + sex, data = alc, family = "binomial")
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.03927371 0.01061421 0.1346865
## goout 2.10746808 1.67636472 2.6864367
## traveltime 1.40810576 0.99222702 2.0108276
## studytime 0.62484081 0.44291107 0.8658021
## sexM 2.01799228 1.20598329 3.4012264
Again, we see that travel time is not significant: its confidence interval includes 1, therefore it doesn’t make a difference what the travel time is for the probabilities.
Other than that, see for example the effect of male sex: it is about twice as likely (CI: 1.21, 3.4) to be a high alcohol user if one is male.
Finally, let’s see if the model does what we claimed it does. For that, let’s use it to predict the outcome for each student and then make a confusion matrix.
# let's drop the insignificant variable
m <- glm(high_use ~ goout + studytime + sex, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
print(table(high_use = alc$high_use, prediction = alc$prediction))
## prediction
## high_use FALSE TRUE
## FALSE 234 25
## TRUE 58 53
# tabulate the target variable versus the predictions; probabilities
print(table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins())
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63243243 0.06756757 0.70000000
## TRUE 0.15675676 0.14324324 0.30000000
## Sum 0.78918919 0.21081081 1.00000000
Hmm, not bad, but the model seems to make awful many false negative predictions (n=58, 16%).
The model should at least do better than a naïve guessing. Let’s try this by calculating a loss function for the predictions and comparing it to how many errors would be caused if every single case is defined as False or True.
# define a loss function (mean prediction error)
loss_func <- function(class, prob, threshold, equals) {
if (equals==TRUE){
n_wrong <- abs(class - prob) = threshold
} else {
n_wrong <- abs(class - prob) > threshold
}
print(mean(n_wrong))
}
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
#loss_f(class = alc$high_use, prob = alc$probability)
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2243243
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
Assuming everyone is non-high user gives an error rate of 0.3 – compared to the error rate of ~0.23 of the model. Therefore, the model at least outperforms a naïve baseline.
K-fold cross validation is a type of validation method in which the input data is split into training and test datasets, which are rotated. The aim is describe model performance in a way that is less susceptible to overfitting and randomness than just training and testing on the whole dataset.
library(boot)
# doing the cross validation – K=number of rows=370
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2648649
THe average error rate of ~0.26 is a bit worse than the whole-dataset error rate of ~0.23. It is no better than the error rate of the model explored in the exercises this week.
This week, we explored logistic regression, odds ratios and (cross) validation. Working through an example dataset showed how a binary variable (high alcohol consumption or not) could be predicted with a GLM using other variables. The performance of the model could then be quantified and validated. The model outdid a naïve baseline but could probably be finetuned with more fitting variables. 3/4 hypotheses I made previously held quite nicely! Only the travel time, which could indicate a great many things, was not useful for prediction.
It’s common to examine data through, e.g., socio-economic and societal premade classes that are fit into data – male and female, young and old etc. Should we be bound to these or could we let suitable classes emerge from the features of the data?
This week, we’ll delve into methods for creating data clusters that are alike in some (meaningful) ways. Once the clusters are created, the classes may be labelled and new observations may be classified into them.
The dataset we’ll be using is provided through the R package ‘MASS’. The ‘Boston’ consists of various descriptive value sof Boston, Massachusetts, such as crime rate per capita, nitrogen oxide concentration rates and apartment values. Full metadata can be found here.
Let’s peek at the data.
suppressMessages(library(MASS))
suppressMessages(library(dplyr))
# load the data
bost <- Boston
# explore the dataset
str(bost)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(bost)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# plot matrix of the variables
#pairs(bost)
The dataframe has 506 rows and 14 columns. The variable types differ quite a bit: some are counts, others e.g. population normalized values and yet one (‘chas’) a binary dummy variable.
The multitude of variables makes simple interpretations rather tough. Let’s check out a mix of histograms and scatter plots anyway.
# needed for histograms
# from pairs plot documentation: https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/pairs.html
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
# plot matrix of the variables
pairs(Boston, cex = 0.05, diag.panel=panel.hist, upper.panel=NULL)
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in par(usr): argument 1 does not name a graphical parameter
For example crime rate and residential land rate variables are heavily left-skewed (they have a lot of low values). Others are closer to normal (‘lstat’) or have other distributions (‘rad’).
Seemingly some of the correlate, but for a proper overview, let’s print out a correlation matrix and a correlation plot.
cor_matrix <- cor(Boston) %>% round(digits=2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# print the correlation matrix
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b",tl.pos = "d",tl.cex = 0.6)
Crime rate has correlates positively with, e.g., property tax rates (‘tax’) and negative ones with, e.g., the proportion of Black population. In general, many of the variables are correlated with each other, which is not surprising, considering the interlinkages of, for example, property values and crime.
As mentioned before, the variables are not directly comparable right now. Let’s use the scaling functionality in R to make them so.
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Alright! All the values are roughly centered around 0 now (roughly varying between -5–5).
We will try to examine crime through the lens of four classes (low, medium low, medium high and high). For that, we’ll need to create a factor variable from a continuous one – in this case, by splitting the crime variable into four equally sized bins.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, label=c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
A random 80 % of the data is used for fitting a linear discriminant model. The rest are set aside for testing.
# number of rows in the Boston dataset
n <- nrow(Boston)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
The point in linear discriminant analysis is to classify an observation to a class based on continuous variables. In this case, we’ll use all other variables to find a combination of variables that discerns the crime rate classes.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2500000 0.2673267 0.2376238
##
## Group means:
## zn indus chas nox rm age
## low 0.92705566 -0.8818215 -0.1132543107 -0.8635030 0.45389024 -0.8460845
## med_low -0.09731296 -0.2370698 0.0005392655 -0.5415579 -0.20307488 -0.2870049
## med_high -0.39116364 0.1709877 0.2380357777 0.3883352 0.09387503 0.4206507
## high -0.48724019 1.0172418 -0.0672717637 1.0680511 -0.32624900 0.8219728
## dis rad tax ptratio black lstat
## low 0.7981639 -0.6906942 -0.7328948 -0.36251560 0.37870307 -0.76823208
## med_low 0.3552537 -0.5406779 -0.4317851 -0.00415666 0.31188928 -0.07355229
## med_high -0.3817869 -0.4395396 -0.3371943 -0.24505633 0.09441777 0.04917981
## high -0.8359794 1.6368728 1.5131579 0.77931510 -0.74246795 0.82763129
## medv
## low 0.51163687
## med_low -0.08678306
## med_high 0.15600414
## high -0.68041051
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.06632181 0.71862781 -0.93792894
## indus 0.06469893 -0.14794000 0.31825327
## chas -0.08622263 -0.09107420 0.08209288
## nox 0.35390064 -0.83292676 -1.27333788
## rm -0.10946286 -0.08091803 -0.26024909
## age 0.22682860 -0.33101976 -0.06527398
## dis 0.00422357 -0.31078724 0.31556412
## rad 3.39693230 0.89239906 -0.15364184
## tax -0.01305008 0.05928923 0.68566175
## ptratio 0.11620785 -0.08282892 -0.36860268
## black -0.09311362 0.03985096 0.13444049
## lstat 0.23673915 -0.35368578 0.30419019
## medv 0.21067597 -0.55444772 -0.21217758
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9465 0.0397 0.0138
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
The biplot shows two clear clusters after the data is collapsed. One on the right seems to have mostly high values – but a lot of noise, too.
Let’s formally test the model.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 16 9 3 0
## med_low 9 13 3 0
## med_high 0 2 14 2
## high 0 0 0 31
The majority of predictions are correct for each class. More extreme values (low and high) are seemingly easier to classify than middle-ground values.
After again scaling the raw values, distance matrices are built – below, two algorithms (Euclidean and Manhattan distance) are used for the task. They give mildly differing distance values, but let’s use the default Euclidean measure.
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method='manhattan')
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
On to clustering. The K-means algorithms wants to know the number of clusters it creates. Let’s start of randomly with 4.
library(ggplot2)
set.seed(42)
# k-means clustering
km <- kmeans(boston_scaled, centers = 4)
# plot the Boston dataset with clusters
# split into two figures for readability
pairs(boston_scaled[1:7],cex=0.5, col = km$cluster)
pairs(boston_scaled[7:14],cex=0.5, col = km$cluster)
To make the analysis more robust, we’ll test out the inner cohesion of the clusters with differing number of centers. Seemingly, 2 is optimal, since the most radical change in the WCSS value happens at that point.
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled,cex=0.5, col = km$cluster, upper.panel=NULL)
# split into two for readability
pairs(boston_scaled[1:7],cex=0.5, col = km$cluster, upper.panel=NULL)
pairs(boston_scaled[7:14],cex=0.5, col = km$cluster, upper.panel=NULL)
With the scaled data, many of the variables seem to form horizontal and vertical “lines” at certain values. I wonder how much that affects clustering. Anyways, visual examination shows that the two clusters seem to be mostly sensible across the variables.
Data scarcity has a twin brother, data deluge. Sometimes we have multiple datasets to choose from, or many variables. These variables may in addition be linked to each other – for example, educational level attained and income are correlated.
This week, we’ll examine methods exploratory methods created for distilling relevant information from many variables to a few. These dimensionality reduction techniques, such as principal component analysis, return fewer columns that try to capture the variation in the original variables.
This week’s data has been fuzed from two datasets with socioeconomic variables describing countries: human development index (HDI) and gender inequality index (GII). The data has is offered by the United Nations Development Programme (UNDP). See full metadata description here and refer to the data wrangling script for further information on how the data has been modified and what the column names refer to.
Let’s read in the data and set country names as index.
library(tibble)
library(GGally)
library(readr)
library(dplyr)
human <- read_csv("data/human.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
human <- column_to_rownames(human, "Country")
Good. Then some data exploration.
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
ggpairs(
human, progress=F,
upper = list(continuous = wrap("density", alpha = 0.5), combo = "box"),
lower = list(continuous = wrap("points", alpha = 0.3, size=0.1),
combo = wrap("dot", alpha = 0.4, size=0.2))
)
Educational variables and share of female parliamentary representation are roughly normally distributed, whereas the rest are skewed left (life expectancy) or right (e.g., gross national income (GNI), adolescent birth rate). Many of the values seem to be correlated based on the scatter plots, but let’s confirm this hunch with a correlation plot.
# Access corrplot
library(corrplot)
# mark insignificant correlations
testRes = cor.mtest(human, conf.level = 0.95)
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot(p.mat = testRes$p, sig.level = 0.05, addCoef.col ='black', number.cex = 0.75, insig='blank', type = 'lower')
The educational variables are highly correlated with each other, life expectancy, GNI (positive) and maternal mortality and adolescent birth rates (negative). Expectedly, less maternal mortality and adolescent births is related to wealthier countries (GNI). Female labor participation and parliamentary representations are only weakly, if at all, correlated with the other variables.
Let’s get on with the PCA. Below is an example of how to not do it. With data that is unstandardized, PCA is dominated by the variables with the largest variances. In this data, that is GNI, which is in dollars / population – much larger values than the percentage shares or ages of the other variables.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2,cex = c(0.5, 1.2), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# create and print out a summary of pca_human
s <- summary(pca_human)
print(s)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# rounded percentages of variance captured by each PC
pca_pr <- round(1*s$importance[2, ]*100, digits = 1)
# print out the percentages of variance
print(pca_pr)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
As a consequence of the dominance of GNI, the principal component mostly describes the variation of GNI, not of the other variables. Standardization is needed.
The values are scaled towards 0 – first, mean of that variable is subracted from each value, the each value is divided by the standard deviation of that variable.
# standardize the variables
human_std <- scale(human)
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# create and print out a summary of pca_human
s <- summary(pca_human)
print(s)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# rounded percentages of variance captured by each PC
pca_pr <- round(1*s$importance[2, ]*100, digits = 1)
# print out the percentages of variance
print(pca_pr)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2,cex = c(0.5, 1.0), col = c("grey40", "deeppink2"))
Standardization makes all the difference for the reasons described above. Now we see that the variables that correlated are close to each other on the X axis (education, income, life expectancy) or opposite with negative correlation (maternal age and health). The two variables that were not as related to the others are on a component of their own.
Although PC interpretation is often not useful – it’s educated guesswork, in essence – let’s try to see how we could interpret the first to principal components and their scatterplot.
Because the first PC seems to be broadly related to people’s health and income levels (positive or negative), I’ll name it “Health and wellbeing”. The second PC gets the label “Female societal participation” after the variables describing parliamentary share and labour participation.
descriptions <- c("PC1: Health and wellbeing", "PC2: Female societal participation")
pc_lab <- paste0(descriptions, " (", pca_pr[1:2], "%)")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2,cex = c(0.5, 1.0), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Interpreted through these labels, wellbeing nations (such as the Nordic countries) cluster to left and top (high health and female societal participation). On the other hand, countries with low female societal status, such as Syria and Iran are placed near bottom of the plot. Developing countries with worse healthcare systems and societal safety nets and thus high maternal mortality and many adolescents giving birth are placed towards the right of the plot.
Now for something completely different. We’ll examine Multiple correspondence analysis, a method related to PCA but applicable to categorical data.
The example data originates from the FactoMineR package – it’s 300 answers related to tea. 18 questions relate to people’s tea drinking habits, 12 to perception of a product and 6 are background questions.
library(FactoMineR)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
dim(tea)
## [1] 300 36
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 +60 :38 +2/day :127
## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
## non-worker :64 25-34:69 1/day : 95
## other worker:20 35-44:40 3 to 6/week: 34
## senior :35 45-59:61
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
# The code below prints bar plots for each variable
# uncomment to run
#factor_variable_names <- names(Filter(is.factor, tea))
#for (variable in factor_variable_names) {
# barplot(table(tea[[variable]]), main = variable, col = "skyblue", cex.main = 0.8)
#}
Let’s filter the data to keep only those columns that relate to tea drinking times and one for sex.
# keep only five columns: sex and those related to tea consumption times.
tea_when <- tea[c("breakfast","tea.time","lunch","dinner", "sex")]
# multiple correspondence analysis
mca <- MCA(tea_when, graph = F)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_when, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Variance 0.303 0.214 0.185 0.161 0.136
## % of var. 30.276 21.425 18.542 16.147 13.610
## Cumulative % of var. 30.276 51.701 70.244 86.390 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.414 0.189 0.210 | 0.736 0.842 0.663 | -0.268 0.129
## 2 | -0.019 0.000 0.001 | 0.212 0.070 0.068 | -0.346 0.215
## 3 | 0.598 0.394 0.113 | -1.029 1.649 0.334 | 0.250 0.112
## 4 | 1.548 2.639 0.700 | -0.298 0.138 0.026 | 0.412 0.306
## 5 | 0.414 0.189 0.210 | 0.736 0.842 0.663 | -0.268 0.129
## 6 | 1.548 2.639 0.700 | -0.298 0.138 0.026 | 0.412 0.306
## 7 | -0.103 0.012 0.015 | 0.528 0.433 0.390 | -0.353 0.224
## 8 | -0.276 0.084 0.145 | -0.626 0.610 0.745 | -0.102 0.019
## 9 | -0.103 0.012 0.015 | 0.528 0.433 0.390 | -0.353 0.224
## 10 | 0.414 0.189 0.210 | 0.736 0.842 0.663 | -0.268 0.129
## cos2
## 1 0.088 |
## 2 0.181 |
## 3 0.020 |
## 4 0.050 |
## 5 0.088 |
## 6 0.050 |
## 7 0.175 |
## 8 0.020 |
## 9 0.175 |
## 10 0.088 |
##
## Categories
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## breakfast | -0.371 4.367 0.127 -6.166 | 0.758 25.768 0.531
## Not.breakfast | 0.343 4.032 0.127 6.166 | -0.700 23.786 0.531
## Not.tea time | 0.802 18.541 0.498 12.205 | 0.271 3.003 0.057
## tea time | -0.621 14.372 0.498 -12.205 | -0.210 2.328 0.057
## lunch | -0.994 9.568 0.170 -7.124 | 0.648 5.753 0.072
## Not.lunch | 0.171 1.645 0.170 7.124 | -0.111 0.989 0.072
## dinner | 2.238 23.156 0.377 10.616 | -0.868 4.922 0.057
## Not.dinner | -0.168 1.743 0.377 -10.616 | 0.065 0.370 0.057
## F | -0.484 9.181 0.342 -10.109 | -0.493 13.453 0.354
## M | 0.706 13.396 0.342 10.109 | 0.719 19.629 0.354
## v.test Dim.3 ctr cos2 v.test
## breakfast 12.599 | -0.368 6.996 0.125 -6.107 |
## Not.breakfast -12.599 | 0.339 6.458 0.125 6.107 |
## Not.tea time 4.132 | 0.103 0.497 0.008 1.564 |
## tea time -4.132 | -0.080 0.385 0.008 -1.564 |
## lunch 4.647 | 2.089 69.026 0.750 14.975 |
## Not.lunch -4.647 | -0.359 11.864 0.750 -14.975 |
## dinner -4.117 | 0.705 3.758 0.037 3.347 |
## Not.dinner 4.117 | -0.053 0.283 0.037 -3.347 |
## F -10.294 | -0.068 0.298 0.007 -1.425 |
## M 10.294 | 0.100 0.435 0.007 1.425 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.127 0.531 0.125 |
## tea.time | 0.498 0.057 0.008 |
## lunch | 0.170 0.072 0.750 |
## dinner | 0.377 0.057 0.037 |
## sex | 0.342 0.354 0.007 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
What does this plot tell us? I’m not quite sure how to interpret it, to
be honest. At least we saw previously that few people drink tea with
dinner – those that do seem to be the outlier in this plot, too.
Last week sees us delving into longitudinal data. It’s common to have repeated measures data not only in behavioural science, the context of this course, but also in other social and natural sciences. This week shows a few approaches for visualizing and exploratively & formally analyzing such data.
RATS data includes measurements of 16 rats’ body weights roughly once a week for 10 weeks and 11 measurements. The data originates from ‘Practical longitudinal data analysis’ by David Hand and Martin Crowder (1996), and has been shared in many R datasets. See for example a description here. The rats were divided into three groups based on their diet.
The data has been transformed to long form with this script.
# necessary libraries
library(tibble)
#library(GGally)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
RATSL <- read_csv("data/RATSL.csv")
## Rows: 176 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): WD
## dbl (4): ID, Group, Weight, Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
dim(RATSL)
## [1] 176 5
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <dbl> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
There’s a total of 176 measurements for these 16 rats. Let’s plot the weight curves of each of them.
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
geom_line(aes(linetype = Group)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
theme(legend.position = "top")
Groups 2 and 3 seem visually about the same, whereas rats in group 1 are several hundred grams lighter than them and do not gain much weight over the measurement period.
Let’s also plot each group and rat in them separately to better visualize possible individual and group differences.
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
We see at least that there’s a potential outlier in group 2 and that group 3 might otherwise have a slightly higher mean weight. Such an interpretation cannot be confirmed with just line visualizations, though.
Let’s therefore create some summary graphs that show the mean values and standards errors of each group at each time point.
RATSS <- RATSL %>%
group_by(Group, Time) %>%
reframe( mean = mean(Weight), se = sd(Weight)/sqrt(length((Weight))) ) %>%
ungroup()
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
#theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(weight) +/- se(weight)")
And boxplots of individual means weights.
RATSSI <- RATSL %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSSI, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "Mean(weight)")
A few data points lie outside the boxes, but the most obvious one is in Group 2, spotted previously. Let’s filter it as an outlier.
RATSSI <- RATSL %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup() %>%
filter(mean < 550)
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSSI, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "Mean(weight)")
Dropping the high value in group 2 could be challenged as a decision, especially as there are rather few observations in each group to begin with. Nonetheless, without it, the groups are visually distinct from each other.
Finally, we should formally test for group differences. Since there are three groups, we should fit an analysis of variance model (ANOVA). This tests for overall differences, after which Tukey’s HSD test is used to find which of the groups actually differ.
anova_model <- aov(mean ~ Group, data = RATSSI)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 206390 103195 483.6 3.39e-12 ***
## Residuals 12 2561 213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(TukeyHSD(anova_model))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mean ~ Group, data = RATSSI)
##
## $Group
## diff lwr upr p adj
## 2-1 185.73864 159.35462 212.1227 0.00e+00
## 3-1 262.07955 238.21430 285.9448 0.00e+00
## 3-2 76.34091 46.57572 106.1061 4.94e-05
The groups are different, all of them from each other.
However, different they are, the diets (Groups) do not seem to have a significant effect on the weighs. The starting point, or first weighing, explains the outcome best. This is tested by fitting a linear model to the value with mean as the target value, followed by computing an analysis of variance table. Notice that Group is insignificant at >0.05 level.
# remove the first measurement, exclude it from mean calc, treat it as baseline
RATSLF <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
#filter(mean < 550)
# add the baseline back
RATSLF <- RATSLF %>%
mutate(baseline = RATS$WD1)
# Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline+Group, data = RATSLF)
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Next, we’ll examine psychiatric treatment scores of 40 subjects over 8 weeks. The target value is called Brief Psychiatric Rating Scale (BPRS from now on.) For the purposes of this course diary, it’s enough to know that lower=better when it comes to BPRS.
The data has been transformed to long form with this script.
BPRSL <- read_csv("data/BPRSL.csv")
## Rows: 360 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): weeks
## dbl (4): treatment, subject, bprs, week
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Factor treatment & subject
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
dim(BPRSL)
## [1] 360 5
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <dbl> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
The data has 360 observations by 40 individuals. Let’s plot them separated by treatment group:
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
There’s a lot of individual variability. Overall, the scores seem to decrease over time, but some individuals differ in this case, too.
Moreover, the treatment groups don’t differ much visually. We can observe this by plotting them on the same graph:
# ggplot doesn't want to plot it when the treatment and subject are not clearly separeted
# this fixes it by creating a new column subject_id that combines treatment and subject with a hyphen
BPRSL <- BPRSL %>%
mutate(subject_id = paste0(as.character(treatment), "-", as.character(subject)))
ggplot(BPRSL, aes(x = week, y = bprs, group = subject_id)) +
geom_line(aes(linetype = treatment)) +
#scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "BPRS") +
theme(legend.position = "top")
Could we be able to model the data with linear models?
This what you shouldn’t do when the observations are not independent – they are temporally autocorrelated. In other words, previous observations predict the succeeding observations quite well.
# create a regression model
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
The treatment is not significant and the model explains the phenomena really poorly in any case (Adjusted R-squared: 0.1806).
A mixed effects model would suit this case better. With a random intercept model, the intercept (what the target value would be when all else is zero) can differ between the subjects.
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
anova(BPRS_ref, BPRS_reg)
## Data: BPRSL
## Models:
## BPRS_reg: bprs ~ week + treatment
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_reg 4 2837.3 2852.9 -1414.7 2829.3
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7 90.624 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Check out ANOVA: model is a significant improvement over the naive linear model (AIC and BIC are lower). Treatment is still not a significant explanatory variable.
Instead of intercept, the slope of each subject can be varied; slope would be the linear change of bprs each week.
BPRS_rsm <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_rsm)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(BPRS_ref, BPRS_rsm, BPRS_reg)
## Data: BPRSL
## Models:
## BPRS_reg: bprs ~ week + treatment
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_rsm: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_reg 4 2837.3 2852.9 -1414.7 2829.3
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7 90.6245 1 < 2e-16 ***
## BPRS_rsm 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A random slope model is a significant improvement over the random intercept model. This implies that the rate of change of BPRS score is more significant than the initial value.
Bringing random intercept and slope together with the interaction between time and treatment to create one final model:
BPRS_rism <- lmer(bprs ~ week + treatment + (week | subject) + week * treatment , data = BPRSL, REML = FALSE)
summary(BPRS_rism)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + week * treatment
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(BPRS_rism, BPRS_ref, BPRS_rsm, BPRS_reg)
## Data: BPRSL
## Models:
## BPRS_reg: bprs ~ week + treatment
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_rsm: bprs ~ week + treatment + (week | subject)
## BPRS_rism: bprs ~ week + treatment + (week | subject) + week * treatment
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_reg 4 2837.3 2852.9 -1414.7 2829.3
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7 90.6245 1 < 2e-16 ***
## BPRS_rsm 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## BPRS_rism 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, this model does not significantly improve over RSM. It still gives (just barely) the lowest AIC. Let’s therefore see what it can do if we model BPRS values and visualize them.
ggplot(BPRSL, aes(x = week, y = bprs, group = subject_id)) +
geom_line(aes(linetype = treatment)) +
#scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Observed BPRS") +
theme(legend.position = "top")
Fitted <- fitted(BPRS_rism)
BPRSL$Fitted <- Fitted
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject_id)) +
geom_line(aes(linetype = treatment)) +
#scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Fitted BPRS") +
theme(legend.position = "top")
Hmm, quite crude approximations. I wonder if a linear model, even one taking random effects into account, is the most approach in this case. Could a higher order model better capture the scores?
Nonetheless, the does not seem to be differences between the treatment groups.